suppressPackageStartupMessages({
library(ArchR)
library(parallel)
library(tidyverse)
library(SingleCellExperiment)
library(zellkonverter)
library(dtwclust)
})
### VARIABLES
proj_name <- "/omics/groups/OE0533/internal/katharina/scDoRI/gastrulation_data_new/Kathi/04_p2g_links/"
p2g_links_file <- "/omics/groups/OE0533/internal/katharina/scDoRI/gastrulation_data_new/Kathi/p2g_links"
# location of seurat file
seurat_file <- "/omics/groups/OE0533/internal/katharina/scDoRI/gastrulation_data_new/Kathi/rna_seurat"
seacell_csv_file <- "/omics/groups/OE0533/internal/katharina/scDoRI/gastrulation_data_new/Kathi/perturbation_sea_df_new"
# which GeneScoreMatrix to use, needs to be in getAvailableMatrices(proj)
GeneScoreMatrix <- "GeneScoreMatrix_geneBody"
proj <- loadArchRProject(proj_name, showLogo = FALSE)
## Successfully loaded ArchRProject!
In order to get gene regulatory links, one can compute a simple correlation between accessible peaks and gene expression, in order to find peaks whose activity is highly correlated with expression of a certain gene. These are potential enhancers of a gene. Since single cell data is very sparse, the common way to compute correlatiosn is by aggregating accessibility and gene expression data across cell aggregates. In ArchR this is done by sampling 500 cells from the entire dataset and finding the 50 nearest neighbors of these cells. These cell aggregates therefore reprsent groups of similar cells and can be used to compute correlations.
Notably, for computing links within a certain distance on the chromosome, ArchR does not take into consideration the strand orientation, but computes the distance between the “start” coordinate and the peak middle coordinate. However, on the minus strand the TSS is the “end” coordinate. For the correlations this is not important, but in my computations I will use the TSS coordinate in a strand-aware fashion.
Based on these putative peak-to-gene links, it is possible to compute gene activity scores. If these scores recapitulate gene expression well, this is a validation of the links. Nevertheless, it is expected that gene activity scores correlate highly with gene expression, since the peaks used for the computation are highly correlated with the genes per definition. In the following you will find a function to compute gene activity scores from peak-to-gene links, adpated from Cicero, where gene activit scores are computed from co-accessible peaks.
Furthermore, since peaks which are farther away from a gene on the genome are less likely to regulate this gene, it is common to use distance weigths to penalize peaks which are highly correlated, but distant. However, I observed that using distance weigths decreases the correlation with gene expression, with less steep decay rates resulting in better gene activity scores. This shows that the distance weights lead to a lot of zero values when multiplied with the correlation values. Still, biologically very distant peaks are probably not correlated with a promoter, because they are intereacting, but more likely, are arbitrary correlations. Finding a good trade-off between considering correlations across large distances, but also using prior knowledge about the biology and restricting the links to a certain window is non-trivial. The correct distance decay rate probably depends on each individual gene and might differ across celltypes.
Yet another approach would be to use only peaks, which are within +/- 100kb of the TSS of a gene, thereby, removing any peaks which are far away. This is similar to the approach in ArchR. Here, the gene activity scores for each gene are computed based on all peaks which are within +/- 100bp of the TSS of the gene. As you will see in the following plots using this approach leads to very high correlations between gene expression and gene activity scores. Computing the scores based on peak-to-gene links offers only a minor improvement.
The main purpose of the entire excercise was to be able to compare the goodness of these links to links obtained using scDoRi. This relationship will have to be explored further.
p2g <- readRDS(p2g_links_file)
Read in the peak accessibility matrix and the gene expression matrix:
# get peak matrix
peaks <- getMatrixFromProject(proj, useMatrix = "PeakMatrix", binarize = FALSE)
PEAK_MAT <- assays(peaks)[[1]]
#saveRDS(peaks, "Kathi/peaks_archr")
#saveRDS(PEAK_MAT, "Kathi/peak_mat_archr")
# read in gne expresssion matrix
gene_expr <- getMatrixFromProject(proj,
useMatrix = "GeneExpressionMatrix")
EXPR_MAT <- assays(gene_expr)[[1]]
rownames(EXPR_MAT) <- rowData(gene_expr)$name
GENE_ANNO <- rowData(gene_expr)
gene_colData <- colData(gene_expr)
#saveRDS(gene_expr, "gene_expr_archr.rds")
# read in archr gene activity scores
archr_scores <- getMatrixFromProject(proj, useMatrix = GeneScoreMatrix)
#saveRDS(archr_scores, "gene_activity_scores_gene_body.rds")
# change celltype column name
cp_names <- colnames(colData(archr_scores))
cp_names[20] <- "celltypes"
colnames(colData(archr_scores)) <- cp_names
# extract gene activity scores from ArchR
archr_scores_mat <- assays(archr_scores)[[1]]
rownames(archr_scores_mat) <- rowData(archr_scores)$name
# get metadata of gene activity scores
archr_colData <- colData(archr_scores)
rm(peaks)
rm(gene_expr)
rm(archr_scores)
gc(reset = TRUE)
CHECK IF THE WHOLE THING RUNS WITHOUT REMOVING NANs
PEAK_MAT <- readRDS("Kathi/peak_mat_archr")
#PEAK_MAT <- PEAK_MAT[, 1:1000]
gene_expr <- readRDS("Kathi/gene_expr_archr.rds")
EXPR_MAT <- assays(gene_expr)[[1]]
rownames(EXPR_MAT) <- rowData(gene_expr)$name
GENE_ANNO <- rowData(gene_expr)
gene_colData <- colData(gene_expr) %>% as.data.frame()
gene_colData <- dplyr::rename(gene_colData, celltypes = celltype.mapped)
#EXPR_MAT <- EXPR_MAT[, 1:1000]
It is important for the functions below that the column which contains the celltypes is called “celltypes”. Here we rename the column from “celltype.mapped” to “celltypes”.
archr_scores <- readRDS("Kathi/gene_activity_scores_gene_body.rds")
#saveRDS(archr_scores, "gene_activity_scores_gene_body.rds")
# change celltype column name
# cp_names <- colnames(colData(archr_scores))
# cp_names[celltype.mapped] <- "celltypes"
# colnames(colData(archr_scores)) <- cp_names
# extract gene activity scores from ArchR
archr_scores_mat <- assays(archr_scores)[[1]]
rownames(archr_scores_mat) <- rowData(archr_scores)$name
# get metadata of gene activity scores
archr_colData <- colData(archr_scores) %>% as.data.frame()
archr_colData <- dplyr::rename(archr_colData, celltypes = celltype.mapped )
#archr_scores_mat <- archr_scores_mat[, 1:1000]
archr_colData <- archr_colData %>% rownames_to_column("cell") %>%
dplyr::filter(cell %in% colnames(archr_scores_mat)) %>%
column_to_rownames("cell")
rm(archr_scores)
rm(gene_expr)
gc(reset = TRUE)
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 6713402 358.6 12220180 652.7 6713402 358.6
## Vcells 5317636143 40570.4 6523077770 49767.2 5317636143 40570.4
We will only use peaks linked to highly variable genes to compute gene activity scores.
# read in list of highly variable genes
hvg_list <- readRDS("Kathi/hvg.rds")
# get RNA index of hvg
meta_rna <- GENE_ANNO %>% as.data.frame() %>% mutate(row_index = seq(nrow(.)))
idx <- (meta_rna %>% dplyr::filter(name %in% hvg_list))$row_index
expr_sub <- EXPR_MAT[idx, ]
seacells <- read.csv(seacell_csv_file)
seacells <- seacells %>% dplyr::filter(index %in% colnames(EXPR_MAT))
#seacells <- drop_na(seacells)
links <- p2g %>% as.data.frame() %>%
dplyr::filter(Correlation > 0.2) %>%
dplyr::filter(idxRNA %in% idx)
stopifnot(all(links$Correlation > 0))
Create a p2g link matrix
p2g_mat <- sparseMatrix(i = links$idxRNA,
j = links$idxATAC,
x= links$Correlation,
dims = c(dim(EXPR_MAT)[1],
dim(PEAK_MAT)[1]))
rownames(p2g_mat) <- GENE_ANNO$name
rownames(PEAK_MAT) <- seq.int(dim(PEAK_MAT)[1])
colnames(p2g_mat) <- seq.int(dim(PEAK_MAT)[1])
Filter and prepare peak matrix and p2g links matrix:
# remove columns of peaks which are not linked to any peak
p2g_mat_sub <- p2g_mat[, colSums(p2g_mat) != 0]
# use only highly variable genes
p2g_mat_sub <- p2g_mat_sub[hvg_list, ]
# remove any genes which are not linked to any peak
p2g_mat_sub <- p2g_mat_sub[rowSums(p2g_mat_sub) != 0, ]
stopifnot(all(rownames(p2g_mat_sub) %in% hvg_list))
stopifnot(any(is.na(p2g_mat_sub) == FALSE))
# keep only peaks which are linked to genes in the accessibility matrix
peak_mat_sub <- PEAK_MAT[colnames(p2g_mat_sub), seacells$index]
stopifnot(rownames(peak_mat_sub) == colnames(p2g_mat_sub))
#stopifnot(any(is.na(peak_mat_sub) == FALSE))
stopifnot(dim(peak_mat_sub)[1] == dim(p2g_mat_sub)[2])
expr_mat_sub <- EXPR_MAT[as.vector(rownames(p2g_mat_sub)), seacells$index]
gene_colData <- gene_colData %>% rownames_to_column("cell") %>%
dplyr::filter(cell %in% colnames(expr_mat_sub)) %>%
column_to_rownames("cell")
archr_scores_sub <- archr_scores_mat[as.vector(rownames(expr_mat_sub)), seacells$index]
gene_activity_scores <- function(peak_mat, p2g_mat) {
#peak_mat_subset <- peak_mat[colnames(p2g_mat), ]
# normalize the p2g matrix by the total number of peaks linked to each gene
p2g_mat <- p2g_mat / rowSums(p2g_mat)
print(paste0("normalized the p2g matrix"))
stopifnot(any(is.na(p2g_mat)) == FALSE)
# Now we can compute a weighted sum of peak2gene correlations for each
# peak and gene
scores <- p2g_mat %*% peak_mat
print(paste0("Computed weightes sum of peaks for each gene and cell"))
# create a dataframe for computing the linear model
linear_model_df <- data.frame(cell = colnames(scores),
total_activity = colSums(scores),
total_sites = colSums(peak_mat))
# compute a linear model
activity_model <- stats::lm(log(total_activity) ~ log(total_sites),
data = linear_model_df)
# extract the fitted model
linear_model_df$fitted_curve <- exp(as.vector(predict(activity_model,
type = "response")))
# compute size factors from fitted model
size_factors <- mean(linear_model_df$fitted_curve) / linear_model_df$fitted_curve
# create diagonal matrix containing the size factors
size_factors_mat <- Matrix::Diagonal(x = size_factors)
#row.names(size_factors_mat) <- linear_model_df$cell
# normalize by library depth size factors
norm_scores <- Matrix::t(size_factors_mat %*% Matrix::t(scores))
print(paste0("Normalized for library size"))
# exponentiate, because RNA counts are log-normally distributed
norm_scores@x <- pmin(1e9, exp(norm_scores@x) - 1)
print(paste0("Exponentiated matrix"))
# free some memory
#rm(peak_mat_subset)
rm(activity_model)
rm(scores)
gc(reset = TRUE)
# scale with total activity scores again
scale_factors <- Matrix::Diagonal(x = 1/Matrix::colSums(norm_scores))
print(paste0("Divided by total activity to get value between zero and one"))
final_scores <- Matrix::t(scale_factors %*% Matrix::t(norm_scores))
return(final_scores)
}
p2g_scores <- gene_activity_scores(peak_mat_sub, p2g_mat_sub)
## [1] "normalized the p2g matrix"
## [1] "Computed weightes sum of peaks for each gene and cell"
## [1] "Normalized for library size"
## [1] "Exponentiated matrix"
## [1] "Divided by total activity to get value between zero and one"
saveRDS(p2g_scores, "Kathi/ArchR_p2g_based_gene_activity_scores.rds")
TODO: Add functions here, add ArchR aggregates and compute correlations as scatter & density plots
# the data matrix needs to be of dimension features x cells
# metadata of data matrix which contains celltypes
# needs to be called "celltypes"
create_celltype_aggregates <- function(metadata, data_matrix, celltypes) {
#create empty matrix to store aggregates
agg <- matrix(data = 0,
nrow = nrow(data_matrix),
ncol = length(celltypes),
dimnames = list(rownames(data_matrix), celltypes))
for (celltype in celltypes) {
barcodes <- rownames(metadata %>%
as.data.frame() %>%
dplyr::filter(celltypes == celltype))
if (length(barcodes) == 1){
agg[, celltype] <- data_matrix[, barcodes]
print(paste0("Warning! Celltype", celltype, "contains only one cell."))
} else{
agg[, celltype] <- rowSums(data_matrix[, barcodes])
}
}
stopifnot(any(is.na(agg)) == FALSE)
return(agg)
}
create_celltype_aggregates_p2g_scores <- function(gene_expr_metadata, p2g_score_matrix, celltypes) {
#create empty matrix to store aggregates
agg <- matrix(data = 0,
nrow = nrow(p2g_score_matrix),
ncol = length(celltypes),
dimnames = list(rownames(p2g_score_matrix), celltypes))
for (celltype in celltypes) {
barcodes <- rownames(gene_expr_metadata %>%
as.data.frame() %>%
dplyr::filter(celltypes == celltype))
if (length(barcodes) == 1) {
agg[, celltypes] <- p2g_score_matrix[, barcodes]
print(paste0("Warning! Celltype", celltype, "contains only one cell."))
} else {
agg[, celltype] <- rowSums(p2g_score_matrix[, barcodes])
}
}
stopifnot(any(is.na(agg)) == FALSE)
return(agg)
}
create_seacell_aggregates <- function(data_matrix, seacells_df){
agg <- matrix(data = 0,
nrow = nrow(data_matrix),
ncol = length(unique(seacells_df$SEACell)),
dimnames = list(rownames(data_matrix),
unique(seacells_df$SEACell)))
#stopifnot(nrow(agg) == nrow(data_matrix))
for (seacell in unique(seacells_df$SEACell)){
#print(seacell)
barcodes <- (seacells_df %>% dplyr::filter(SEACell == seacell))$index
#print(barcodes)
if (length(barcodes) == 1){
agg[, seacell] <- data_matrix[, barcodes]
print(paste0("Warning! SEACell", seacell, "contains only one cell."))
} else{
agg[, seacell] <- rowSums(data_matrix[, barcodes])
}
}
stopifnot(any(is.na(agg)) == FALSE)
return(agg)
}
ROW_CORR_THEME <- theme(axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title.x = element_text(size = 15), # change size of axis title
axis.title.y = element_text(size = 15),
plot.title = element_text(hjust = 0.5, size = 20),
legend.position = "None",
panel.grid.major = element_line(colour = "grey"), # Major grid lines
panel.background = element_rect(fill = "white", colour = "black"),
aspect.ratio = 0.4)
rowwise_correlations <- function(MatrixA, MatrixB, name) {
intersect_genes <- intersect(rownames(MatrixA), rownames(MatrixB))
MatrixA <- MatrixA[intersect_genes, ]
MatrixB <- MatrixB[intersect_genes, ]
correlations <- c()
for (i in seq.int(dim(MatrixA)[1])) {
rowA <- MatrixA[i, ]
rowA <- rowA - mean(rowA)
if (sd(rowA) != 0) {
rowA <- rowA / sd(rowA)
}
rowB <- MatrixB[i, ]
rowB <- rowB - mean(rowB)
if (sd(rowB) != 0){
rowB <- rowB / sd(rowB)
}
corr_value <- mean(rowA * rowB)
correlations <- c(correlations, corr_value)
}
names(correlations) <- rownames(MatrixA)
plot <- ggplot() + geom_histogram(aes(x = correlations),
bins = 200,
fill="#69b3a2") + labs(title = paste0(name)) +
ROW_CORR_THEME
return(list(correlations, plot))
}
Plot Theme for comparing correlations.
CORR_THEME <- theme(axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title.x = element_text(size = 15), # change size of axis title
axis.title.y = element_text(size = 15),
plot.title = element_text(hjust = 0.5, size = 20),
legend.position = "None",
panel.grid.major = element_line(colour = "grey"), # Major grid lines
panel.background = element_rect(fill = "white", colour = "black"),
aspect.ratio = 1)
Plot theme for plots which are mcuh wider than high.
THEME_LONG <- theme(axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title.x = element_text(size = 15), # change size of axis title
axis.title.y = element_text(size = 15),
plot.title = element_text(hjust = 0.5, size = 20),
legend.position = "None",
panel.grid.major = element_line(colour = "grey"), # Major grid lines
panel.background = element_rect(fill = "white", colour = "black"),
aspect.ratio = 0.2)
Plot theme for standard plots (slighly wider than high).
CUSTOM_THEME <- theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 12),
axis.text.y = element_text(size = 12),
axis.title.x = element_text(size = 20), # change size of axis title
axis.title.y = element_text(size = 20),
plot.title = element_text(hjust = 0.5, size = 25),
legend.position = "None",
panel.grid.major = element_line(colour = "grey"), # Major grid lines
panel.background = element_rect(fill = "white", colour = "black"),
aspect.ratio = 0.3)
To compute the correlations between gene expression and ArchR gene activity scores I first aggregated cells according to celltypes to compute correlations. As can be seen in the plot below, this yields very high correlation values. This is as expected, since in (Granja JM 2021) the authors compared 52 different ways of computing gene activity scores from ATAC-seq data and found their method to be the best one.
name <- "ArchR_scores, Celltype aggregates"
archr_scores_agg <- create_celltype_aggregates(archr_colData, archr_scores_sub,
unique(archr_colData$celltypes))
expr_agg_celltypes <- create_celltype_aggregates(gene_colData, expr_mat_sub,
unique(gene_colData$celltypes))
p2g_agg_celltypes <- create_celltype_aggregates_p2g_scores(gene_colData, p2g_scores,
unique(gene_colData$celltypes))
archr_corrs <- rowwise_correlations(expr_agg_celltypes, archr_scores_agg, "Correlation gene expr. & ArchR gene activity scores")
p2g_corrs <- rowwise_correlations(expr_agg_celltypes, p2g_agg_celltypes, "Correlation gene expr. & p2g activity scores")
cowplot::plot_grid(archr_corrs[[2]], p2g_corrs[[2]], ncol = 1)
ggsave("Kathi/plots/Correlations_cell_aggregates.pdf")
## Saving 10 x 6 in image
ggplot() + #geom_density2d_filled(aes(x = correlations_250kb, y = corrs[1])) #+
geom_point(aes(y = archr_corrs[[1]], x = p2g_corrs[[1]])) +
geom_density_2d_filled(aes(y = archr_corrs[[1]], x = p2g_corrs[[1]]), alpha = 0.5) +
geom_line(aes(x = archr_corrs[[1]], archr_corrs[[1]]), color = "red") +
labs(y = "Correlation gene expr. & p2g activity scores",
x = "Correlation gene expr. & ArchR gene activity scores",
title = "Celltype aggregates") +
CORR_THEME
ggsave("Kathi/plots/comparing_archr_all_p2g_links.pdf")
## Saving 6 x 6 in image
Instead of using celltype aggregates as above, another option is to use SEACells as described in (Persad et al. 2022). These were computed using Python and the resulting cell aggregates (“metacells”) are used for aggregating gene expression and gene activity scores below. The correlations when using SEACells are much higher than the correlations obtained using the ArchR cell aggregates. For this reason I will use SEACells for computing correlations in the following steps.
#seacells <- seacells %>% dplyr::filter(index %in% colnames(EXPR_MAT))
stopifnot(nrow(p2g_scores) == nrow(expr_mat_sub))
seacell_p2g_agg <- create_seacell_aggregates(p2g_scores, seacells)
seacell_rna_agg <- create_seacell_aggregates(expr_mat_sub, seacells)
seacell_archr_agg <- create_seacell_aggregates(archr_scores_sub, seacells)
seacell_corr_p2g <- rowwise_correlations(seacell_rna_agg , seacell_p2g_agg,
"P2g links of entire chromosome, SEAcells" )
seacell_corr_archr <- rowwise_correlations(seacell_rna_agg, seacell_archr_agg,
"ArchR gene activity scores, SEAcells")
cowplot::plot_grid(seacell_corr_p2g[[2]] , seacell_corr_archr[[2]], ncol = 1)
ggplot() + #geom_density2d_filled(aes(x = correlations_250kb, y = corrs[1])) #+
geom_point(aes(x = seacell_corr_p2g[[1]], y = seacell_corr_archr[[1]])) +
geom_density_2d_filled(aes(x = seacell_corr_p2g[[1]], y = seacell_corr_archr[[1]]), alpha = 0.5) +
geom_line(aes(x = seacell_corr_archr[[1]], y = seacell_corr_archr[[1]]), color = "red" ) +
labs(x = "Correlation gene expr. & all p2g activity scores",
y = "Correlation gene expr. & ArchR gene activity scores",
title = "SEACells, all p2g links") +
CORR_THEME +
theme(legend.position = "None")
ggsave("Kathi/plots/seacell_aggreagates_p2g_archr.pdf")
## Saving 6 x 6 in image
Using ArchR (Granja JM 2021) I computed peak-to-gene links across the entire chromosome, but not between chromsomes. This means that a lot of correlations are found between peaks very far away from the promoter/gene they are linked to. Even though these correlations can be quite high and interactions between enhancers and promoters can occur over megabase distances, a real biological interaction becomes less likely the larger the distance is. Therefore, since wer are interested in biologically relevant and not spurious correlations. Therefore, as suggested by (Granja JM 2021), I added distance weights, such that farther away peaks linked to a gene contribute less to the gene activity score of this particular gene.
Here, I used a distance decay from the TSS, computed as follows:
\(weight = e^{-(abs(distTSS/c))}\) with \(c\) being a constant determining the exponential decay rate of the distance weights. Below I tried different rates to better understand whether we can improve the gene activity scores by giving a higher weight to close peaks than to far away peaks. As can be seen below this did not improve, the scores, but rather the scores became worse, which is probably due to the fact that most correlation values will get very small weights this way and most peaks linked with a gene, even if the correlation value is high, will not contribute to the gene activity score anymore.
**Careful: The p2g inks in ArchR are computed for peak and gene pairs which are within a certain distance from each other. However, not the real TSS of a gene is used for this, but rater the distance between start coordinate of a gene and peak start coordinate, not taking into consideration the strand directionality.
# As input for this function it is best to use only the most highly variable genes
distanc_weighted_gene_activity_scores <- function(p2g_mat_sub, geneModel = "exp(-distance/5000)",
weight = 50000,
peak_mat, links, p2g_original, gene_anno){
atac_granges <- metadata(p2g_original)[[1]]
#rna_granges <- metadata(p2g_original)[[2]]
gene_anno_original <- gene_anno
# create gene annotations with start coordinate of each gene
# subset to contain only genes which are included in our peak2gene matrix
gene_anno <- gene_anno %>% as.data.frame() %>%
mutate(idxRNA = seq(nrow(.))) %>%
dplyr::filter(name %in% rownames(p2g_mat_sub)) %>%
mutate(strand = ifelse(strand == 1, "+", "-")) %>%
mutate(start_coord = ifelse(strand == "+", start, end)) %>%
rename(gene = name) #%>% GRanges()
# subset atac granges & get middle of each peak
pos_atac_granges <- atac_granges %>%
as.data.frame() %>%
mutate(idxATAC = seq(nrow(.))) %>%
# group_by(seqnames) %>%
# mutate(idx = seq_along(seqnames)) %>%
# ungroup %>%
#tidyr::unite(chr_idx, seqnames, idx, remove = FALSE, sep = "_") %>%
dplyr::filter(idxATAC %in% colnames(p2g_mat_sub)) %>%
mutate(middle = start + 300) #%>% GRanges()
#TODO: Filter for genes!
stopifnot(length(unique(links$idxATAC)) == dim(pos_atac_granges)[[1]])
stopifnot(length(unique(links$idxRNA)) == dim(gene_anno)[[1]])
#p2g_filt <- p2g_original %>% as.data.frame() %>% filter(gene %in% rownames(p2g_mat))
# combine the three dataframes
p2g_join <- left_join(links, as.data.frame(pos_atac_granges),
by = "idxATAC")
p2g_join <- left_join(p2g_join, as.data.frame(gene_anno),
by = "idxRNA", suffix = c(".atac", ".rna"))
# compute distance and distance weights
p2g_join <- p2g_join %>%
mutate(distance = abs(start_coord - middle)) %>%
mutate(distance_weight = eval(parse(text=geneModel)))
# create distance weight matrix
p2g_dw <- sparseMatrix(i = p2g_join$idxRNA,
j = p2g_join$idxATAC,
x = p2g_join$distance_weight,
dims = c(nrow(gene_anno_original),
nrow(peak_mat)),
dimnames = list(gene_anno_original$name ,
seq.int(nrow(peak_mat)))
)
p2g_dw <- p2g_dw[as.vector(rownames(p2g_mat_sub)), colnames(p2g_mat_sub)]
# elementwise matrix multiplication
weighted_p2g_mat <- p2g_mat_sub * p2g_dw
print(paste(length(which(rowSums(weighted_p2g_mat) == 0)), "genes have only zero correlation values, so we will remove them."))
weighted_p2g_mat <- weighted_p2g_mat[rowSums(weighted_p2g_mat) != 0, ]
print(paste0("We are left with ", dim(weighted_p2g_mat)[1], " genes"))
# compute gene activity scores based on distance-weighted peak2gene matrix
weighted_scores <- gene_activity_scores(peak_mat_sub, weighted_p2g_mat)
return(weighted_scores)
}
weighted_scores <- distanc_weighted_gene_activity_scores(p2g_mat_sub = p2g_mat_sub,
geneModel = "exp(-distance/5000)",
weight = 50000,
peak_mat = PEAK_MAT,
links = links,
p2g_original = p2g,
gene_anno = GENE_ANNO)
## [1] "21 genes have only zero correlation values, so we will remove them."
## [1] "We are left with 1689 genes"
## [1] "normalized the p2g matrix"
## [1] "Computed weightes sum of peaks for each gene and cell"
## [1] "Normalized for library size"
## [1] "Exponentiated matrix"
## [1] "Divided by total activity to get value between zero and one"
Trying different distance decay rates, seacell aggreagates
SEAcells
# prepare lists to store correlation vectors and correlation histograms
corr_list <- list(seacell_corr_archr[[1]], seacell_corr_p2g[[1]])
model_list <- c("exp(-abs(distance)/5000)", "exp(-abs(distance)/50000)",
"exp(-abs(distance)/500000)", "exp(-abs(distance)/5000000)")
# compute the distance-weighted gene activity scores from p2g links using different
# distance weight models
for (model in model_list){
weighted_scores <- distanc_weighted_gene_activity_scores(p2g_mat_sub,
geneModel = model,
weight = 50000,
peak_mat = PEAK_MAT,
links = links,
p2g_original = p2g,
gene_anno = GENE_ANNO)
agg_dist <- create_seacell_aggregates(weighted_scores, seacells)
dist_knn <- rowwise_correlations(seacell_rna_agg, agg_dist, name = paste0("P2g activity scores, distance weighted, model = ", model))
stopifnot(any(is.na(dist_knn)) == FALSE)
corr_list <- append(corr_list, dist_knn[[1]])
print(dist_knn[[2]])
ggsave(paste0("Kathi/plots/corr_distance_weights_", model, "_seacells.png"))
plot <- ggplot() + #geom_density_2d_filled(aes(x = corr_list[[i]],
# y = corr_list[[1]]), alpha = .5) +
geom_point(aes(x = dist_knn[[1]],
y = corr_list[[1]][names(dist_knn[[1]])])) +
geom_density_2d_filled(aes(
x = dist_knn[[1]],
y = corr_list[[1]][names(dist_knn[[1]])] ),
alpha = 0.5) +
geom_line(aes(
x = corr_list[[1]][names(dist_knn[[1]])],
y = corr_list[[1]][names(dist_knn[[1]])]), col = "red") +
CORR_THEME +
labs(x = "Correlation gene expr. & p2g activity scores",
title = paste0(model, "SEACells"),
y = "Correlation gene expr. & ArchR gene activity scores")
print(plot)
ggsave(paste0("Kathi/plots/comp_corr_distance_weights_", model, "_seacells.png"))
}
## [1] "21 genes have only zero correlation values, so we will remove them."
## [1] "We are left with 1689 genes"
## [1] "normalized the p2g matrix"
## [1] "Computed weightes sum of peaks for each gene and cell"
## [1] "Normalized for library size"
## [1] "Exponentiated matrix"
## [1] "Divided by total activity to get value between zero and one"
## Saving 6 x 6 in image
## Saving 6 x 6 in image
## [1] "0 genes have only zero correlation values, so we will remove them."
## [1] "We are left with 1710 genes"
## [1] "normalized the p2g matrix"
## [1] "Computed weightes sum of peaks for each gene and cell"
## [1] "Normalized for library size"
## [1] "Exponentiated matrix"
## [1] "Divided by total activity to get value between zero and one"
## Saving 6 x 6 in image
## Saving 6 x 6 in image
## [1] "0 genes have only zero correlation values, so we will remove them."
## [1] "We are left with 1710 genes"
## [1] "normalized the p2g matrix"
## [1] "Computed weightes sum of peaks for each gene and cell"
## [1] "Normalized for library size"
## [1] "Exponentiated matrix"
## [1] "Divided by total activity to get value between zero and one"
## Saving 6 x 6 in image
## Saving 6 x 6 in image
## [1] "0 genes have only zero correlation values, so we will remove them."
## [1] "We are left with 1710 genes"
## [1] "normalized the p2g matrix"
## [1] "Computed weightes sum of peaks for each gene and cell"
## [1] "Normalized for library size"
## [1] "Exponentiated matrix"
## [1] "Divided by total activity to get value between zero and one"
## Saving 6 x 6 in image
## Saving 6 x 6 in image
There are two options when defining the gene window. One option is to extend +/- 100bp up- and downstream of the TSS. However, since genes have different sizes, some gene bodies might be much larger than these gene windows. The second option is to extend the gene window not from the TSS, but from the start and end corrdinate of the gene body respectively. This way, more peaks will be taken into consideration if a gene is larger, simply because the gene window will be larger. Therefore, in ArchR they use an additional weight for the gene body size to account for this effect. Here, we extend the gene window around the TSS. As can be seen in the plot below, this does not yield better results, probably, because we are removing a lot of correlations which are high and, therefore, important for the prediction.
This is not what would be expected, since some high correlations within the gene window are very likely to be biologically important and should recapitulate gene expression quite well. This is also shown by the ArchR gene activity scores, which use gene window as well to restric the influence of accessible regions to a certain window around the gene’s TSS. One reason could be that the peak-to-gene links identified by simple correlations are not biologically meaningful, therefore also very far away correlations are important for recapitulating gene expression.
# As input for this function it is best to use only the most highly variable genes
compute_gene_window_score <- function(p2g_mat_sub, peak_mat, links, p2g_original, gene_anno){
gene_anno_original <- gene_anno
# create gene annotations with start coordinate of each gene
# subset to contain only genes which are included in our peak2gene matrix
gene_anno <- gene_anno %>%
as.data.frame() %>%
mutate(idxRNA = seq(nrow(.))) %>%
dplyr::filter(name %in% rownames(p2g_mat_sub)) %>%
mutate(strand = ifelse(strand == 1, "+", "-")) %>%
mutate(start_coord = ifelse(strand == "+", start, end)) %>%
rename(gene = name) #%>% GRanges()
# extend gene regions +/- 100bp up- and downstream of the TSS
gene_regions <- resize(gene_anno %>% GRanges(), width = 1)
extendedGeneRegion <- (suppressWarnings(extendGR(gene_regions,
upstream = 100000,
downstream = 100000)))
# subset atac granges & get middle of each peak
pos_atac_granges <- metadata(p2g_original)[[1]] %>%
as.data.frame() %>%
mutate(idxATAC = seq(nrow(.))) %>%
# group_by(seqnames) %>%
# mutate(idx = seq_along(seqnames)) %>%
# ungroup %>%
#tidyr::unite(chr_idx, seqnames, idx, remove = FALSE, sep = "_") %>%
dplyr::filter(idxATAC %in% colnames(p2g_mat_sub)) %>%
mutate(middle = start + 300) #%>% GRanges()
#TODO: Filter for genes!
stopifnot(length(unique(links$idxATAC)) == dim(pos_atac_granges)[[1]])
stopifnot(length(unique(links$idxRNA)) == dim(gene_anno)[[1]])
#p2g_filt <- p2g_original %>% as.data.frame() %>% filter(gene %in% rownames(p2g_mat))
# find overlapping peaks and gene window in chromosome-aware fashion
tmp <- suppressWarnings(findOverlaps(extendedGeneRegion, pos_atac_granges %>% GRanges()))
print(paste0("Out of ", subjectLength(tmp), " peaks only ",
length(unique(subjectHits(tmp))), " peaks are found within gene window of 200kb."))
### some plots
p1 <- (tmp %>% as.data.frame() %>%
group_by(queryHits) %>% # gene region
summarize(n = n()) %>% # get number of peaks overlapping with a gene region
ggplot() + geom_histogram(aes(x = n), bins = 100, fill="#69b3a2") +
labs(title = "Gene window of size +/- 100kb from TSS",
x = "#peaks within gene window")) +
CORR_THEME
# combine the three dataframes
p2g_join <- left_join(links, as.data.frame(pos_atac_granges),
by = "idxATAC")
p2g_join <- left_join(p2g_join, as.data.frame(gene_anno),
by = "idxRNA", suffix = c(".atac", ".rna"))
# compute distance and distance weights
p2g_join <- p2g_join %>%
mutate(distance = abs(start_coord - middle))
p2 <- p2g_join %>% ggplot() +
geom_histogram(aes(x = distance), bins = 100, fill="#69b3a2") +
labs(title = "P2g links within +/- 100kb window", x = "Distance (bp)") +
geom_vline(xintercept = 100000, color = "orange") +
CORR_THEME
# p2 <- p2g_join %>% ggplot() +
# geom_histogram(aes(x = (distance_weight)), bins = 100) +
# scale_y_log10() +
# labs(title = "Distance Weights", x = "distance weights")
print(cowplot::plot_grid(p1, p2, ncol = 2))#), ncol = 2))
ggsave("Kathi/plots/gene_window_distance.pdf")
# create a dataframe of all peaks which overlap their corresponding gene window
peaks_in_gene_window <- data.frame(gene = gene_regions[queryHits(tmp)]$gene,
peak = (pos_atac_granges %>% GRanges())[subjectHits(tmp)]$idxATAC) %>%
unite(peak_gene_window, gene, peak, sep = "#", remove = FALSE)
# filter the p2g link dataframe for only peaks which are within a gene window
corr_window <- p2g_join %>%
unite(peak_gene_window, gene, idxATAC, sep = "#", remove = FALSE) %>%
dplyr::filter(peak_gene_window %in% peaks_in_gene_window$peak_gene_window)
### PLOTS
p1 <- corr_window %>%
ggplot() +
geom_histogram(aes(x = Correlation), bins = 200, fill = "#69b3a2") +
labs(title = "P2g links within +/- 100kb gene window") +
THEME_LONG
p2 <- corr_window %>%
ggplot() +
geom_histogram(aes(x = distance), bins = 200, fill = "#69b3a2") +
labs(x = "Distance (bp)") +
THEME_LONG
p3 <- corr_window %>%
mutate(bin = cut_width(distance, width=10000, boundary=0)) %>%
ggplot() +
geom_violin(aes(x = bin, y = Correlation), fill = "#69b3a2", alpha = .6, scale = "width") +
geom_boxplot(aes(x = bin, y = Correlation), alpha = 0) +
labs(title = "P2g links within +/- 100kb gene window",
x = "Distance (10kp bins)") +
scale_x_discrete(guide = guide_axis(angle = 45)) +
THEME_LONG
print(cowplot::plot_grid(p1, p2, ncol = 1))
ggsave("Kathi/plots/distance_correlation.pdf")
print(p3)
ggsave("Kathi/plots/Distance_vs_corr_gene_window.pdf")
# p1 <- ggplot() +
# geom_histogram(aes(x = rowSums(p2g_mat_sub > 0)), bins = 200, fill = "#69b3a2") +
# scale_y_log10() +
# labs(title = "# peaks correlated with each gene",
# x = "number of peaks", y = "log10(count)")
#
#
# p2 <- ggplot() +
# geom_histogram(aes(x = colSums(p2g_mat_sub > 0)), bins = 70, fill = "#69b3a2") +
# scale_y_log10() +
# labs(title = "# genes correlated with each peak",
# y = "log10(count)", x = "number of genes")
p3 <- ggplot() +
geom_histogram(aes(x = rowSums(p2g_mat_sub > 0)), bins = 200, fill = "#69b3a2") +
labs(title = "# peaks correlated with each gene",
x = "number of peaks", y = "count") +
CORR_THEME
p4 <- ggplot() +
geom_histogram(aes(x = colSums(p2g_mat_sub > 0)), bins = 70, fill = "#69b3a2") +
labs(title = "# genes correlated with each peak",
y = "count", x = "number of genes") +
CORR_THEME
print(cowplot::plot_grid(p3, p4, ncol = 2))
ggsave("Kathi/plots/genes_peaks_correlations.pdf")
# Create a matrix of distance weight
p2g_links_gene_window <- Matrix::sparseMatrix(
i = corr_window$idxRNA,
j = corr_window$idxATAC,
x = corr_window$Correlation,
dims = c(nrow(gene_anno_original), nrow(peak_mat)),
dimnames = list(gene_anno_original$name,
seq.int(nrow(peak_mat)))
)
print(paste0("The peak-to-gene links matrix, restricted to a +/- 100kb window around the TSS has dimensions ", split(dim(p2g_links_gene_window), 1)))
print(paste0("The maximum value is: ", max(p2g_links_gene_window), ", the minimum value is: ", min(p2g_links_gene_window) ))
# remove any rows or columns with zeros
p2g_links_gene_window <- p2g_links_gene_window[rowSums(p2g_links_gene_window) != 0, ]
p2g_links_gene_window <- p2g_links_gene_window[, colSums(p2g_links_gene_window) != 0]
print(paste0("After removing any rows and columsn which do not contain any links we are left with ", nrow(p2g_links_gene_window), " genes and ", ncol(p2g_links_gene_window), " peaks."))
# Compute gene activity scores with weighted distance matix
gene_window_scores <- gene_activity_scores(peak_mat_sub[colnames(p2g_links_gene_window), ], p2g_links_gene_window)
return(gene_window_scores)
}
gene_window_scores <- compute_gene_window_score(
p2g_mat_sub = p2g_mat_sub,
peak_mat = PEAK_MAT,
links = links,
p2g_original = p2g,
gene_anno = GENE_ANNO)
## [1] "Out of 139966 peaks only 29829 peaks are found within gene window of 200kb."
## Saving 10 x 8 in image
## Saving 10 x 8 in image
## Saving 10 x 8 in image
## Saving 10 x 8 in image
## [1] "The peak-to-gene links matrix, restricted to a +/- 100kb window around the TSS has dimensions c(22273, 192251)"
## [1] "The maximum value is: 0.976679757235753, the minimum value is: 0"
## [1] "After removing any rows and columsn which do not contain any links we are left with 1512 genes and 11639 peaks."
## [1] "normalized the p2g matrix"
## [1] "Computed weightes sum of peaks for each gene and cell"
## [1] "Normalized for library size"
## [1] "Exponentiated matrix"
## [1] "Divided by total activity to get value between zero and one"
Second, I compared the distance weigths using the SEACell aggregates, which yields better results as can be seen above.
gene_window_agg <- create_seacell_aggregates(gene_window_scores, seacells)
gene_window_corr <- rowwise_correlations(seacell_rna_agg, gene_window_agg,
name = "Gene window around TSS")
gene_window_corr[[2]]
ggsave("Kathi/plots/correlation_p2g_gene_window_archr.pdf")
## Saving 6 x 6 in image
ggplot() +
geom_point(aes(x = gene_window_corr[[1]],
y = seacell_corr_archr[[1]][names(gene_window_corr[[1]])])) +
geom_density_2d_filled(aes(
y = seacell_corr_archr[[1]][names(gene_window_corr[[1]])],
x = gene_window_corr[[1]]), alpha = 0.5) +
geom_line(aes(x = seacell_corr_archr[[1]][names(gene_window_corr[[1]])],
y = seacell_corr_archr[[1]][names(gene_window_corr[[1]])]),
color = "red") +
CORR_THEME +
labs(x = "Correlation gene expr. & p2g activity scores",
y = "Correlation gene expression & ArchR gene activity scores",
title = "Peak-to-gene links within gene window")
ggsave("Kathi/plots/compare_correlation_p2g_gene_window_archr.pdf")
## Saving 6 x 6 in image
How does the distance weight distribution change with different decay rates?
Here, we use the formula \(e^{\frac{-abs(distance)}{c}}\) with differen decay rates \(c \in \{5000, 50000, 500000, 5000000\}\). Additionally, we use only peaks which overlap with a +/- 100kb window from the TSS.
model_list <- c("exp(-abs(distance)/5000)", "exp(-abs(distance)/50000)",
"exp(-abs(distance)/500000)", "exp(-abs(distance)/5000000)")
atac_granges <- metadata(p2g)[[1]]
#rna_granges <- metadata(p2g_original)[[2]]
gene_anno <- GENE_ANNO
# create gene annotations with start coordinate of each gene
# subset to contain only genes which are included in our peak2gene matrix
gene_anno <- gene_anno %>% as.data.frame() %>%
mutate(idxRNA = seq(nrow(.))) %>%
dplyr::filter(name %in% rownames(p2g_mat_sub)) %>%
mutate(strand = ifelse(strand == 1, "+", "-")) %>%
mutate(start_coord = ifelse(strand == "+", start, end)) %>%
rename(gene = name) #%>% GRanges()
# subset atac granges & get middle of each peak
pos_atac_granges <- atac_granges %>%
as.data.frame() %>%
mutate(idxATAC = seq(nrow(.))) %>%
# group_by(seqnames) %>%
# mutate(idx = seq_along(seqnames)) %>%
# ungroup %>%
#tidyr::unite(chr_idx, seqnames, idx, remove = FALSE, sep = "_") %>%
dplyr::filter(idxATAC %in% colnames(p2g_mat_sub)) %>%
mutate(middle = start + 300) #%>% GRanges()
# combine the three dataframes
p2g_join <- left_join(links, as.data.frame(pos_atac_granges),
by = "idxATAC")
p2g_join <- left_join(p2g_join, as.data.frame(gene_anno),
by = "idxRNA", suffix = c(".atac", ".rna"))
# compute distance and distance weights
p2g_join <- p2g_join %>%
mutate(distance = abs(start_coord - middle)) %>%
mutate(rel_distance = start_coord - middle)
# mutate(distance_weight = eval(parse(text=geneModel)))
distribution of distance weights using different distance decay rates
for (model in model_list){
# compute distance and distance weights
p2g_join <- p2g_join %>%
mutate(distance = abs(start_coord - middle)) %>%
mutate(distance_weight = eval(parse(text=model)))
p1 <- p2g_join %>% ggplot() +
geom_histogram(aes(x = distance), bins = 200, fill="#69b3a2") +
labs(title = "Distance between peaks and genes", x = "distance") +
geom_vline(xintercept = 5000, color = "red") +
geom_vline(xintercept = 250000, color = "orange") +
theme(axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title.x = element_text(size = 20), # change size of axis title
axis.title.y = element_text(size = 20),
plot.title = element_text(hjust = 0.5, size = 20),
legend.position = "None",
panel.grid.major = element_line(colour = "grey"), # Major grid lines
panel.background = element_rect(fill = "white", colour = "black"),
aspect.ratio = 1.2)
p2 <- p2g_join %>% ggplot() +
geom_histogram(aes(x = (distance_weight)), bins = 200, fill="#69b3a2") +
scale_y_log10() +
labs(title = paste0(model),
x = "distance weights", y = "log10(counts)") +
theme(axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title.x = element_text(size = 20), # change size of axis title
axis.title.y = element_text(size = 20),
plot.title = element_text(hjust = 0.5, size = 20),
legend.position = "None",
panel.grid.major = element_line(colour = "grey"), # Major grid lines
panel.background = element_rect(fill = "white", colour = "black"),
aspect.ratio = 1.2)
print(cowplot::plot_grid(p1, p2, ncol = 2))
ggsave(paste0("Kathi/plots/distance_weights", model, ".pdf"))
}
## Saving 10 x 6 in image
## Saving 10 x 6 in image
## Saving 10 x 6 in image
## Saving 10 x 6 in image
# Relationship between distance and correlation value
# p3 <- p2g_join %>% ggplot() +
# geom_point(aes(x = Correlation, y = distance)) +
# labs(title = "Distance vs. correlation between peaks and genes",
# x = "Correlation between peak and gene",
# y = "Distance between peak and gene")
#
#
# p4 <- p2g_join %>% ggplot() +
# geom_point(aes(x = Correlation, y = distance_weight)) +
# labs(title = "Distance vs. correlation between peaks and genes",
# x = "Correlation between peak and gene",
# y = "Distance weights between peak and gene")
#cowplot::plot_grid(p1, p2, ncol = 1)
# Olot relationship between distance and correlation as density plots
p1 <- p2g_join %>% ggplot() +
geom_density_2d_filled(aes(x = Correlation, y = distance)) +
theme(legend.position = "None") +
labs(title = "Relationship between distance and correlation")
p2 <- p2g_join %>%
filter(Correlation > 0.3) %>%
ggplot() +
geom_density_2d_filled(aes(x = Correlation, y = distance)) +
theme(legend.position = "None") +
labs(title = "Relationship between distance and correlation")
p3 <- p2g_join %>%
filter(Correlation > 0.6) %>%
ggplot() +
geom_density_2d_filled(aes(x = Correlation, y = distance)) +
theme(legend.position = "None") +
labs(title = "Relationship between distance and correlation")
cowplot::plot_grid(p1, p2, p3, ncol = 2)
p2g %>%
mutate(bin=cut_width(distance, width=100000, boundary=0)) %>%
filter(distance < 10000000) %>%
ggplot() +
geom_boxplot(aes(x = bin, y = Correlation), fill="#69b3a2") +
#geom_vline(xintercept = 250000, color = "red") +
labs(title = "Relationship between distance and correlation of p2g links, 100kb bins",
x = "Distance between peaks and genes within 250kb", y = "Correlation") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p2g_join %>%
mutate(bin=cut_width(distance, width=100000, boundary=0)) %>%
dplyr::filter(distance < 10000000) %>%
ggplot() +
geom_violin(aes(x = bin, y = Correlation), fill = "#69b3a2", alpha = .6, scale = "width") +
geom_boxplot(aes(x = bin, y = Correlation), alpha = 0) +
#geom_vline(xintercept = 250000, color = "red") +
labs(title = "P2g links, entire chromsome",
x = "Distance (100kb)", y = "Correlation") +
CUSTOM_THEME
#theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
ggsave("Kathi/plots/relationship_dist_corr_10Mb.pdf")
## Saving 15 x 6 in image
Figures for Report
p1 <- p2g_join %>%
mutate(bin=cut_width(distance, width=100000, boundary=0)) %>%
dplyr::filter(distance < 10000000 & Correlation > 0.5) %>%
ggplot() +
geom_violin(aes(x = bin, y = Correlation), fill = "#69b3a2", alpha = .6, scale = "width") +
geom_boxplot(aes(x = bin, y = Correlation), alpha = 0) +
labs(title = "Relationship between distance and correlation of p2g links, 100kb bins",
x = "Distance < 1e^7 bp", y = "Correlation > 0.5") +
scale_fill_brewer(palette="Set3") +
CUSTOM_THEME
#theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
print(p1)
ggsave("Kathi/plots/Relationship_dist_corr_diff_values_1.pdf")
## Saving 12 x 6 in image
p2 <- p2g_join %>%
mutate(bin=cut_width(distance, width=100000, boundary=0)) %>%
dplyr::filter(distance < 10000000 & Correlation > 0.8) %>%
ggplot() +
geom_violin(aes(x = bin, y = Correlation), fill = "#69b3a2", alpha = .6, scale = "width") +
geom_boxplot(aes(x = bin, y = Correlation), alpha = 0) +
labs(title = "Relationship between distance and correlation of p2g links, 100kb bins",
x = "Distance < 1e^7 bp", y = "Correlation > 0.8") +
scale_fill_brewer(palette="Set3") +
CUSTOM_THEME
print(p2)
ggsave("Kathi/plots/Relationship_dist_corr_diff_values_2.pdf")
## Saving 12 x 6 in image
p3 <- p2g_join %>%
mutate(bin=cut_width(distance, width=100000, boundary=0)) %>%
dplyr::filter(distance < 10000000 & Correlation < 0.5) %>%
ggplot() +
geom_violin(aes(x = bin, y = Correlation), fill = "#69b3a2", alpha = .6, scale = "width") +
geom_boxplot(aes(x = bin, y = Correlation), alpha = 0) +
labs(title = "Relationship between distance and correlation of p2g links, 100kb bins",
x = "Distance < 1e^7 bp", y = "Correlation < 0.5") +
scale_fill_brewer(palette="Set3") +
CUSTOM_THEME
print(p3)
ggsave("Kathi/plots/Relationship_dist_corr_diff_values_3.pdf")
## Saving 12 x 6 in image
p4 <- p2g_join %>%
mutate(bin=cut_width(distance, width=1000, boundary=0)) %>%
dplyr::filter(distance < 100000) %>% # & Correlation > 0.5) %>%
ggplot() +
geom_violin(aes(x = bin, y = Correlation), fill = "#69b3a2", alpha = .6, scale = "width") +
geom_boxplot(aes(x = bin, y = Correlation), alpha = 0) +
labs(title = "Relationship between distance and correlation of p2g links, 1kb bins",
x = "Distance < 100kb") +
scale_fill_brewer(palette="Set3") +
CUSTOM_THEME
print(p4)
ggsave("Kathi/plots/Relationship_dist_corr_diff_values_4.pdf")
## Saving 12 x 6 in image
```#{r, fig.width=15, fig.height=15} cowplot::plot_grid(p1, p2, p3, p4, ncol = 2)
```
Lets have a look at correlation values between peaks within the promoter region of a TSS, namely 5kb upstream of the TSS.
p2g_join %>%
mutate(bin=cut_width(rel_distance, width=100, boundary=0)) %>%
dplyr::filter(rel_distance < 0 & rel_distance >= -5000) %>%
ggplot() +
geom_violin(aes(x = bin, y = Correlation), fill = "#69b3a2", alpha = .6, scale = "width") +
geom_boxplot(aes(x = bin, y = Correlation), alpha = 0) +
labs(title = "Distance -5kb upstream of TSS, 100bp bins",
x = "Distance -5kb upstream of TSS", y = "Correlation") +
scale_fill_brewer(palette="Set3") +
CUSTOM_THEME
ggsave("Kathi/plots/distance_within_promoter.pdf")
## Saving 10 x 6 in image
In case Hi-C data are available, TAD boundaries could aid in finding peak-to-gene links. Setting a distance decay rate to the same value for all genes and celltypes, does not give credit to the biological variability associated with gene regulation. In (Zuin J. 2022) it has been shown experimentally, that interactions between regulatory elements decay exponentially within TAD boundaries and almost disappear completely beyond TAD boundaries. Here, I restricted the peak-to-gene links identified by ArchR to within TAD boundaries and computed gene activity scores again.
tad_boundaries <- as.data.frame(read.table("Kathi/tad_e75.bed", header = FALSE, sep = "\t", stringsAsFactors = FALSE, quote = ""))
tad_boundaries <- tad_boundaries %>%
rename(seqnames = V1, start = V2, end = V3) %>%
GRanges()
p1 <- ggplot() + geom_histogram(aes(x = width(gene_anno %>% GRanges())),
bins = 200) +
geom_vline(xintercept = median(width(gene_anno %>% GRanges())),
color = "orange") +
labs(title = paste0("Distribution of gene size, median size = ",
median(width(gene_anno %>% GRanges())), " bp"),
x = "gene size (bp)") +
THEME_LONG
p2 <- ggplot() + geom_histogram(aes(x = width(tad_boundaries)), bins = 200) +
geom_vline(xintercept = median(width(tad_boundaries)), color = "orange") +
labs(title = paste0("Distribution of TAD boundary size, median size = ",
median(width(tad_boundaries)), " bp"),
x = "TAD boundary size (bp)") +
THEME_LONG
cowplot::plot_grid(p1, p2, ncol = 1)
ggsave("Kathi/plots/tad_gene_size_distribution.pdf")
## Saving 10 x 6 in image
What is the distribution of peaks and genes within TAD boundaries?
gene_anno <- GENE_ANNO
# create gene annotations with start coordinate of each gene
# subset to contain only genes which are included in our peak2gene matrix
gene_anno <- gene_anno %>% as.data.frame() %>%
mutate(idxRNA = seq(nrow(.))) %>%
dplyr::filter(name %in% rownames(p2g_mat_sub)) %>%
mutate(strand = ifelse(strand == 1, "+", "-")) %>%
mutate(start_coord = ifelse(strand == "+", start, end)) %>%
rename(gene = name) #%>% GRanges()
# subset atac granges & get middle of each peak
pos_atac_granges <- atac_granges %>%
as.data.frame() %>%
mutate(idxATAC = seq(nrow(.))) %>%
# group_by(seqnames) %>%
# mutate(idx = seq_along(seqnames)) %>%
# ungroup %>%
#tidyr::unite(chr_idx, seqnames, idx, remove = FALSE, sep = "_") %>%
dplyr::filter(idxATAC %in% colnames(p2g_mat_sub)) %>%
mutate(middle = start + 300) #%>% GRanges()
#TODO: Filter for genes!
stopifnot(length(unique(links$idxATAC)) == dim(pos_atac_granges)[[1]])
stopifnot(length(unique(links$idxRNA)) == dim(gene_anno)[[1]])
#p2g_filt <- p2g_original %>% as.data.frame() %>% filter(gene %in% rownames(p2g_mat))
# find overlapping peaks and gene window in chromosome-aware fashion
tad_overlaps_genes <- suppressWarnings(findOverlaps(gene_anno %>% GRanges(),
tad_boundaries))
p1 <- tad_overlaps_genes %>% as.data.frame() %>%
group_by(subjectHits) %>%
summarise(n = n()) %>%
ggplot() + geom_histogram(aes(x = n), bins = 100) +
labs(title = "Number of highly variable genes within a tad boundary",
x = "number of genes/tad boundary") +
CORR_THEME
tad_overlaps_peaks <- suppressWarnings(findOverlaps(pos_atac_granges %>% GRanges(),
tad_boundaries))
p2 <- tad_overlaps_peaks %>% as.data.frame() %>%
group_by(subjectHits) %>%
summarise(n = n()) %>%
ggplot() + geom_histogram(aes(x = n), bins = 100) +
labs(title = "Number of peaks within a tad boundary",
x = "number of peaks/tad boundary") +
CORR_THEME
cowplot::plot_grid(p1, p2, ncol = 2)
ggsave("Kathi/plots/peaks_genes_per_peaks.pdf")
## Saving 10 x 6 in image
Here I visualize the relationship between distance between peaks and genes and their respective correlation values using all positive links obtained using ArchR
p2g_join_all %>%
mutate(bin=cut_width(distance, width=100000, boundary=0)) %>%
filter(distance < 10000000) %>%
ggplot() +
geom_boxplot(aes(x = bin, y = Correlation), fill="#69b3a2") +
#geom_vline(xintercept = 250000, color = "red") +
labs(title = "Relationship between distance and correlation of p2g links, 100kb bins",
x = "Distance between peaks and genes within 250kb", y = "Correlation") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
colPalette_celltypes = c('#532C8A',
'#c19f70',
'#f9decf',
'#c9a997',
'#B51D8D',
'#3F84AA',
'#9e6762',
'#354E23',
'#F397C0',
'#ff891c',
'#635547',
'#C72228',
'#f79083',
'#EF4E22',
'#989898',
'#7F6874',
'#8870ad',
'#647a4f',
'#EF5A9D',
'#FBBE92',
'#139992',
'#cc7818',
'#DFCDE4',
'#8EC792',
'#C594BF',
'#C3C388',
'#0F4A9C',
'#FACB12',
'#8DB5CE',
'#1A1A1A',
'#C9EBFB',
'#DABE99',
'#65A83E',
'#005579',
'#CDE088',
'#f7f79e',
'#F6BFCB')
tad_boundaries %>% as.data.frame() %>% group_by(seqnames) %>%
summarise(n = n()) %>% ungroup() %>%
ggplot() + geom_col(aes(x = seqnames, y = n), color = "black", alpha = .6) +#, fill = seqnames), alpha = .7) +#, position = "dodge")
theme(legend.position = "None") +
scale_fill_manual(values = colPalette_celltypes) +
labs(y = "number of tad boundaries", x = "chromosome") +
CUSTOM_THEME
ggsave("Kathi/plots/tads_per_chromosome.pdf")
## Saving 10 x 6 in image
TODO: Should I also remove peaks which are across TAD boundaries?
# As input for this function it is best to use only the most highly variable genes
tad_boundaries_p2g_scores <- function(p2g_mat_sub, peak_mat, links, p2g_original, gene_anno, tad_boundaries){
atac_granges <- metadata(p2g_original)[[1]]
#rna_granges <- metadata(p2g_original)[[2]]
gene_anno_original <- gene_anno
# create gene annotations with start coordinate of each gene
# subset to contain only genes which are included in our peak2gene matrix
gene_anno <- gene_anno %>% as.data.frame() %>%
mutate(idxRNA = seq(nrow(.))) %>%
dplyr::filter(name %in% rownames(p2g_mat_sub)) %>%
mutate(strand = ifelse(strand == 1, "+", "-")) %>%
mutate(start_coord = ifelse(strand == "+", start, end)) %>%
rename(gene = name) #%>% GRanges()
# subset atac granges & get middle of each peak
pos_atac_granges <- atac_granges %>%
as.data.frame() %>%
mutate(idxATAC = seq(nrow(.))) %>%
# group_by(seqnames) %>%
# mutate(idx = seq_along(seqnames)) %>%
# ungroup %>%
#tidyr::unite(chr_idx, seqnames, idx, remove = FALSE, sep = "_") %>%
dplyr::filter(idxATAC %in% colnames(p2g_mat_sub)) %>%
mutate(middle = start + 300) #%>% GRanges()
#TODO: Filter for genes!
stopifnot(length(unique(links$idxATAC)) == dim(pos_atac_granges)[[1]])
stopifnot(length(unique(links$idxRNA)) == dim(gene_anno)[[1]])
#p2g_filt <- p2g_original %>% as.data.frame() %>% filter(gene %in% rownames(p2g_mat))
# find overlapping peaks and gene window in chromosome-aware fashion
tad_overlaps_genes <- suppressWarnings(findOverlaps(gene_anno %>% GRanges(),
tad_boundaries))
# get genes which are not found within two TAD boundaries, but only within one
within_genes <- (tad_overlaps_genes %>%
as.data.frame() %>%
group_by(queryHits) %>%
summarise(n = n()) %>% ungroup() %>%
dplyr::filter(n < 2))$queryHits
print(paste0("Out of ", nrow(gene_anno), " genes, ", length(unique(queryHits(tad_overlaps_genes))), " genes are within TAD boundaries. Some of these genes even span across TAD boudnaries, namely ", length(within_genes), "."))
# We only keep genes within boundaries, but not genes crossing boundaries
tad_overlaps_genes <- tad_overlaps_genes %>% as.data.frame %>%
dplyr::filter(queryHits %in% within_genes) #%>% S4Vectors::as()
# get peaks overlapping with tad boundaries
tad_overlaps_peaks <- suppressWarnings(findOverlaps(pos_atac_granges %>% GRanges(),
tad_boundaries))
# filter for peaks overlapping tad boundaries which also contain genes
tad_overlaps_peaks <- tad_overlaps_peaks %>% as.data.frame() %>%
dplyr::filter(subjectHits %in% tad_overlaps_genes$subjectHits)
# combine tad boundaries which contain genes and peaks
tad_combine <- left_join(tad_overlaps_genes, tad_overlaps_peaks,
copy = TRUE, by = "subjectHits", suffix = c(".gene", ".peak"))
genes <- gene_anno[tad_combine$queryHits.gene, ] %>%
mutate(tad_index = tad_combine$subjectHits)
peak_coll <- pos_atac_granges[tad_combine$queryHits.peak, ] %>%
mutate(tad_index = tad_combine$subjectHits)
gene_peak_tad_df <- left_join(genes, peak_coll, by = "tad_index", suffic = c(".gene", ".peak")) %>% unite(peak_gene_tad, gene, idxATAC, sep = "#", remove = FALSE)
### some plots
p1 <- (tad_overlaps_peaks %>% as.data.frame() %>%
group_by(subjectHits) %>% # gene region
summarize(n = n()) %>% # get number of peaks overlapping with a gene region
ggplot() + geom_histogram(aes(x = n), bins = 100, fill="#69b3a2") +
labs(title = "Number of peaks / TAD boundary",
x = "number of peaks")) +
CORR_THEME
p2 <- (tad_overlaps_genes %>% as.data.frame() %>%
group_by(subjectHits) %>% # gene region
summarize(n = n()) %>% # get number of peaks overlapping with a gene region
ggplot() + geom_histogram(aes(x = n), bins = 100, fill="#69b3a2") +
labs(title = "Number of hvg genes / TAD boundary",
x = "number of genes")) +
CORR_THEME
print(cowplot::plot_grid(p1, p2, ncol = 2))
ggsave("Kathi/plots/tad_boundaries_number_peaks_genes.pdf")
# combine the annotation dataframe with the p2g links dataframe
p2g_join <- left_join(links, as.data.frame(pos_atac_granges),
by = "idxATAC")
p2g_join <- left_join(p2g_join, as.data.frame(gene_anno),
by = "idxRNA", suffix = c(".atac", ".rna"))
# compute distance
p2g_join <- p2g_join %>%
mutate(distance = abs(start_coord - middle),
rel_distance = start_coord - middle)
# filter for the p2g links within tad boundaries
corr_tad_boundary <- p2g_join %>%
unite(peak_gene_tad, gene, idxATAC, sep = "#", remove = FALSE) %>%
dplyr::filter(peak_gene_tad %in% gene_peak_tad_df$peak_gene_tad)
### PLOTS
p1 <- corr_tad_boundary %>%
ggplot() +
geom_histogram(aes(x = Correlation), bins = 200, fill = "#69b3a2") +
labs(title = "P2g links within TAD boundaries") +
THEME_LONG
p2 <- corr_tad_boundary %>%
ggplot() +
geom_histogram(aes(x = distance), bins = 200, fill = "#69b3a2") +
labs(x = "Distance (bp)") +
THEME_LONG
p3 <- corr_tad_boundary %>%
mutate(bin = cut_width(distance, width=100000, boundary=0)) %>%
ggplot() +
geom_violin(aes(x = bin, y = Correlation), fill = "#69b3a2", alpha = .6, scale = "width") +
geom_boxplot(aes(x = bin, y = Correlation), alpha = 0) +
labs(title = "Distance vs. Correlation, within TAD boundaries",
x = "Distance (100kb bins)") +
scale_x_discrete(guide = guide_axis(angle = 45)) +
THEME_LONG
print(cowplot::plot_grid(p1, p2, ncol = 1))
ggsave("Kathi/plots/Distance_corr_tad.pdf")
print(p3)
ggsave("Kathi/plots/Distance_vs_corr.pdf")
#### PLOT
p1 <- corr_tad_boundary %>% ggplot() +
geom_histogram(aes(x = rel_distance), bins = 500, fill = "#69b3a2") +
scale_y_log10() +
labs(title = "P2g links within tad boundaries",
x = "Relative distance to TSS (bp)", y = "log10(count)") +
geom_vline(xintercept = c(100000, -100000), color = "orange") +
CORR_THEME
print(p1)
ggsave("Kathi/plots/Relative_distance_tad.pdf")
p2g_links_tad <- Matrix::sparseMatrix(
i = corr_tad_boundary$idxRNA,
j = corr_tad_boundary$idxATAC,
x = corr_tad_boundary$Correlation,
dims = c(nrow(gene_anno_original), nrow(peak_mat)),
dimnames = list(gene_anno_original$name,rownames(peak_mat))
)
print(paste0("The maximum value is: ", max(p2g_links_tad), ", the minum value is: ", min(p2g_links_tad) ))
p2g_links_tad <- p2g_links_tad[rowSums(p2g_links_tad) != 0, ]
p2g_links_tad <- p2g_links_tad[, colSums(p2g_links_tad) != 0]
print(paste0("After removing any rows and columsn which do not contain any links we are left with ", nrow(p2g_links_tad), " genes and ", ncol(p2g_links_tad), " peaks."))
# Compute gene activity scores
tad_scores <- gene_activity_scores(peak_mat_sub[colnames(p2g_links_tad), ], p2g_links_tad)
return(tad_scores)
}
gc(reset = TRUE)
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 7356894 393 21162874 1130.3 7356894 393
## Vcells 6809709670 51954 11317924632 86349.0 6809709670 51954
tad_scores <- tad_boundaries_p2g_scores(p2g_mat_sub = p2g_mat_sub,
peak_mat = PEAK_MAT,
links = links,
p2g_original = p2g,
gene_anno = GENE_ANNO,
tad_boundaries = tad_boundaries)
## [1] "Out of 1710 genes, 1522 genes are within TAD boundaries. Some of these genes even span across TAD boudnaries, namely 1433."
## Saving 10 x 6 in image
## Saving 10 x 6 in image
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Saving 10 x 6 in image
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 140 rows containing missing values (geom_bar).
## Saving 10 x 6 in image
## Warning: Transformation introduced infinite values in continuous y-axis
## Removed 140 rows containing missing values (geom_bar).
## [1] "The maximum value is: 0.976679757235753, the minum value is: 0"
## [1] "After removing any rows and columsn which do not contain any links we are left with 1346 genes and 21964 peaks."
## [1] "normalized the p2g matrix"
## [1] "Computed weightes sum of peaks for each gene and cell"
## [1] "Normalized for library size"
## [1] "Exponentiated matrix"
## [1] "Divided by total activity to get value between zero and one"
gc(reset = TRUE)
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 7770355 415.0 21162874 1130.3 7770355 415.0
## Vcells 7454864208 56876.2 11317924632 86349.0 7454864208 56876.2
gene_window_agg <- create_seacell_aggregates(tad_scores, seacells)
gene_window_corr <- rowwise_correlations(seacell_rna_agg, gene_window_agg,
name = "Gene window around TSS")
gene_window_corr[[2]]
ggplot() +
geom_point(aes(y = seacell_corr_archr[[1]][names(gene_window_corr[[1]])],
x =gene_window_corr[[1]])) +
geom_density_2d_filled(aes(y = seacell_corr_archr[[1]][names(gene_window_corr[[1]])],
x = gene_window_corr[[1]]), alpha = 0.5) +
geom_line(aes(x = seacell_corr_archr[[1]][names(gene_window_corr[[1]])],
y = seacell_corr_archr[[1]][names(gene_window_corr[[1]])],
color = "red")) +
labs(x = "Correlation gene expr. & p2g activity scores",
y = "Correlation gene expr. & ArchR gene activity scores",
title = "P2g links within TAD boundaries") +
CORR_THEME
ggsave("Kathi/plots/compare_tad_all_links.pdf")
## Saving 10 x 6 in image
ggplot() +
geom_point(aes(y = seacell_corr_p2g[[1]][names(gene_window_corr[[1]])],
x =gene_window_corr[[1]])) +
geom_density_2d_filled(aes(y = seacell_corr_p2g[[1]][names(gene_window_corr[[1]])],
x = gene_window_corr[[1]]), alpha = 0.5) +
geom_line(aes(x = seacell_corr_p2g[[1]][names(gene_window_corr[[1]])],
y = seacell_corr_p2g[[1]][names(gene_window_corr[[1]])],
color = "red")) +
labs(x = "Correlation gene_expr. & p2g links within TAD boundaries",
y = "Correlation gene expr. & all p2g links ",
title = "P2g links within TAD boundaries vs. all links") +
CORR_THEME
ggsave("Kathi/plots/compare_tad_archr.pdf")
## Saving 8 x 5 in image
ArchR provides a function to compute gene activity scores based on accessibility in the regions around the gene. For this a tile matrix is used. This tile matrix is a matrix where the genome is divided into bins of 500bp. If there is a Tn5 insertion in a bin the entry will be 1, if there is no insertion the entry will be 0. Importantly, they compared their function to 52 other functions and found their own function to be the best performing.
Here I tried to better understand how this function works and changed the source code of the ArchR function to also take peak matrix as input and compute the gene activity based on peaks, rather than based on tiles. Additionally, I adapted the funciton in a way such that it takes tad boundaries as input and uses all peaks which are within the same tad boundary as a gene to compute the activity scores.
There are two different options for computing gene activity scores in ArchR. First, we can use the TSS and create a gene window around it (+/- 100kp of TSS). All insertions found within tiles within this gene window will be accumulated for the gene activity scores. If we set the option ‘useGeneBoundaries=TRUE’ then we will make sure that no regions overlap between any two genes. If the gene window of one gene overlaps with the gene window of another gene, those tiles are not considered anymore. The disadvantage of this approac is that genes can be very large (>100bp), meaning that in some cases the 100kp extension downstream of the TSS would not even contain the entire gene body.
Second, we can use the entire gene body and extend the gene window beyond the start and end coordinates of the gene body. Importantly, the gene body is extended 5kb upstream of the TSS, to also include the promoter region. Using the entire gene body instead of only the TSS can be achieved by setting ‘useTSS=FALSE’. In this approach the gene window is created by extending -100kb upstream of the TSS -5kb and +100kb downstream of the gene end coordinate. This way, the entire gene body will be included in the gene window. An unwanted consequence of this might be that very large genes could bias the gene activity scores. Therefore ArchR introduces a weight for the inverse of the gene body size according to:
\(w = \frac{1}{gene size}\) with \(w\) being the inverse of the gene size. $
geneRegions\(geneWeight = 1 + w * (geneScaleFactor - 1) / (max(w) - min(w))\). The default geneScaleFactor is 5.
Additionally, ArchR uses a distance weight. Farther away tiles/peaks are less likely to interact with a TSS than closer tiles/peaks. If the first approach, using only the TSS, the distance weights are computed as follows:
\(weight = e^{-(abs(distTSS/5000))}\) with \(distTSS\) being the distance from the TSS. This way the weights decay exponentially with distance. The constant value of \(5000\) is a parameter which could be optimized for different genes or datasets, but here we will keep it constant.
In case the entire gene body is used, the distance weights are kept constant for all tiles/peaks within the gene body and only decay beyond the gene body.
\(weight = \begin{cases} if (-5kb from TSS, TTS): 1 + e^{-1} \\ else: e^{-abs(distGB/5000) + e^{-1}} \end{cases}\)
Instead of using a +/-100kb window around the gene body, in the adapted function all peaks which are within the same TAD boundary as the gene of interest are considered for the activity score of that gene. The distance weight with c = 5000 is kept the same as for the default ArchR function. As can be seen below, extending the gene window to TAD boundaries yields very similar results compared to the default ArchR function.
proj <- loadArchRProject("Kathi/06_gene_activity_scores/")
proj <- addTADGeneScoreMatrix(
proj,
genes = getGenes(proj),
peaks = getPeakSet(proj),
tadBoundaries = tad_boundaries,
geneModel = "exp(-abs(x)/500000) + exp(-1)",
matrixName = "GeneScoreMatrix_tads_500k",
extendUpstream = c(1000, 100000),
extendDownstream = c(1000, 100000),
geneUpstream = 5000, #New Param
#geneDownstream = 0, #New Param
useGeneBoundaries = FALSE,
useTSS = FALSE, #New Param
extendTSS = FALSE,
tileSize = 500,
ceiling = 4,
geneScaleFactor = 5, #New Param
scaleTo = 10000,
excludeChr = c("chrY", "chrX", "chrM"),
blacklist = getBlacklist(proj),
threads = 1,
parallelParam = NULL,
subThreading = TRUE,
force = TRUE,
logFile = createLogFile(".addTADGeneScoreMat"))
scores <- getMatrixFromProject(proj, useMatrix = "GeneScoreMatrix")
score_mat <- assays(scores)[[1]]
rownames(score_mat) <- rowData(scores)$name
#saveRDS(scores, "tad_scores")
ggplot() +
geom_histogram(aes(x = rowSums(score_mat)), bins = 200)
# compute aggregates of ArchR gene activity score matrix
default_archr <- create_seacell_aggregates(archr_scores_mat,
seacells)
# compute aggregates for tad boundary ArchR gene activity score matrix
tad_archr <- create_seacell_aggregates(score_mat, seacells)
# compute aggregates of gene expression matrix
rna_hvg <- create_seacell_aggregates(expr_sub, seacells)
# correlation between gene expression values and default Archr gene activity scores
default_archr_corr <- rowwise_correlations(rna_hvg, default_archr,
"ArchR gene activity scores, SEAcells")
# correlation between gene expression and TAD boundary gene activity scores
tad_corr <- rowwise_correlations(rna_hvg, tad_archr, "ArchR gene activity scores within TAD boundaries, SEACells")
cowplot::plot_grid(default_archr_corr[[2]], tad_corr[[2]], ncol = 2)
ggplot() +
geom_point(aes(x = tad_corr[[1]], y = default_archr_corr[[1]][names(tad_corr[[1]])])) +
geom_density_2d_filled(aes(x = tad_corr[[1]],
y = default_archr_corr[[1]][names(tad_corr[[1]])]),
alpha = .5) +
geom_line(aes(x = default_archr_corr[[1]], y = default_archr_corr[[1]]), col = "red") +
theme(legend.position = "None") +
labs(x = "Correlation gene expression & ArchR TAD boundary scores",
title = "Restricting ArchR scores to within TAD boundaries",
y = "Correlation gene expression & ArchR gene activity scores")
Since the TAD boundaries used here, are from gastrulation day E7.5. For the later time points no TAD boundaries are available. Therefore, in the following I will check if the results improve in comparison to the default ArchR function when using only data from E7.5. Since during gastrulation TAD boundaries might still be very dynamic the improving effect of TAD boundaries could be diluted by later time points in the data.
What are th genes which get zero activity scores? Do they lie outside the TAD boundaries?
e75_meta <- colData(scores) %>% as.data.frame() %>%
filter(Sample %in% c("E7.5_rep1", "E7.5_rep2")) %>%
rownames_to_column("cell")
mat_75 <- score_mat[rownames(score_mat) %in% rownames(expr_sub), e75_meta$cell]
seacells_sub <- seacells %>% filter(index %in% colnames(mat_75))
# compute aggregates of ArchR gene activity score matrix
default_archr <- create_seacell_aggregates(archr_scores_mat[rownames(archr_scores_mat) %in%
rownames(expr_sub),
e75_meta$cell],
seacells_sub)
# compute aggregates for tad boundary ArchR gene activity score matrix
tad_archr <- create_seacell_aggregates(mat_75, seacells_sub)
# compute aggregates of gene expression matrix
rna_hvg <- create_seacell_aggregates(expr_sub[, e75_meta$cell], seacells_sub)
# correlation between gene expression values and default Archr gene activity scores
default_archr_corr <- rowwise_correlations(rna_hvg, default_archr,
"ArchR gene activity scores, SEAcells")
# correlation between gene expression and TAD boundary gene activity scores
tad_corr <- rowwise_correlations(rna_hvg, tad_archr, "ArchR gene activity scores within TAD boundaries, SEACells")
cowplot::plot_grid(default_archr_corr[[2]], tad_corr[[2]], ncol = 2)
ggplot() +
geom_point(aes(x = tad_corr[[1]], y = default_archr_corr[[1]][names(tad_corr[[1]])])) +
geom_density_2d_filled(aes(x = tad_corr[[1]],
y = default_archr_corr[[1]][names(tad_corr[[1]])]),
alpha = .5) +
geom_line(aes(x = default_archr_corr[[1]], y = default_archr_corr[[1]]), col = "red") +
theme(legend.position = "None") +
labs(x = "Correlation gene expression & ArchR TAD boundary scores",
title = "Restricting ArchR scores to within TAD boundaries",
y = "Correlation gene expression & ArchR gene activity scores")
What are the genes which get zero correlation with gene expression?
There are 8 genes which get zero correlation values between gene activity scores and gene expression. This is, because they get zero activity scores in all cells. However, the same genes are expressed to certain levels according to the gene expression matrix. Two of the genes also get zero activity scores in the default ArchR function (Prl2c3, Gsdmc4). The reason for is not immediately clear, since as long as there are peaks in a gene window, the distance weight will at least be 0.36 accorindg to the formula. One reason for zero values could be that these genes lie outside TAD boundaries wich is in fact the case for four out of 8 genes.
What is the explanation why Lyz2 and Gm13547 get activity scores of zero?
zero_genes <- names(tad_corr[[1]][tad_corr[[1]] == 0])
zero_mat <- score_mat[zero_genes, ]
rowSums(zero_mat)
# check the default ArchR scores for these genes
rowSums(archr_scores_mat[zero_genes, ])
# check the gene expression coutns for these genes
rowSums(expr_mat[zero_genes,])
p2g_pos <- p2g %>% as.data.frame() %>% filter(Correlation > 0) %>%
unite(link, idxRNA, idxATAC, sep = "%", remove = FALSE)
gene_anno_all <- rowData(gene_expr) %>% as.data.frame() %>%
mutate(idxRNA = seq(nrow(.))) %>%
filter(idxRNA %in% p2g_pos$idxRNA) %>%
mutate(strand = ifelse(strand == 1, "+", "-")) %>%
mutate(start_coord = ifelse(strand == "+", start, end)) %>%
rename(gene = name) #%>% GRanges()
# subset atac granges & get middle of each peak
pos_atac_granges_all <- metadata(p2g)[[1]] %>%
as.data.frame() %>%
mutate(idxATAC = seq(nrow(.))) %>%
# group_by(seqnames) %>%
# mutate(idx = seq_along(seqnames)) %>%
# ungroup %>%
#tidyr::unite(chr_idx, seqnames, idx, remove = FALSE, sep = "_") %>%
filter(idxATAC %in% p2g_pos$idxATAC) %>%
mutate(middle = start + 300) #%>% GRanges()
# combine the three dataframes
p2g_join_all <- left_join(p2g_pos, as.data.frame(pos_atac_granges_all),
by = "idxATAC")
p2g_join_all <- left_join(p2g_join_all, as.data.frame(gene_anno_all),
by = "idxRNA", suffix = c(".atac", ".rna"))
p2g_join_all <- p2g_join_all %>%
mutate(distance = abs(start_coord - middle))
# find overlapping peaks and gene window in chromosome-aware fashion
tad_overlaps_genes <- (findOverlaps(gene_anno_all %>% GRanges(),
tad_boundaries))
# get all genes which are found within tad boudnaries
gene_anno_within_tad <- gene_anno_all[queryHits(tad_overlaps_genes),]
# Lets examine the genes which are found within tad boundaries, but
# get an activity score of zero nevertheless
gene_anno_within_tad %>% filter(gene %in% zero_genes)
gene_name = "Lyz2"
chr_name = "chr2"
chrx <- tad_boundaries %>% as.data.frame() %>% filter(seqnames == chr_name) %>%GRanges()
hits <- findOverlaps(gene_anno_all %>% filter(gene == gene_name) %>% GRanges(), chrx)
start_tad <- start(chrx[subjectHits(hits),])
end_tad <- end(chrx[subjectHits(hits),])
start_gene <- start(gene_anno_all %>% filter(gene == gene_name) %>% GRanges())
end_gene <- end(gene_anno_all %>% filter(gene == gene_name) %>% GRanges())
print(paste0("Out of ", length(zero_genes), " genes, ", length(zero_genes[zero_genes %in% gene_anno_within_tad$gene]) , " genes are found within tad boundaries, while the rest are not."))
pos_atac_granges_all %>% as.data.frame() %>% filter(seqnames == chr_name) %>%
filter(start > start_tad & end < end_tad)
#
# zero_genes
#
# idx <- (gene_anno_all %>% filter(gene %in% zero_genes))$idxRNA
#
# idx %in% gene_anno_all[tad_overlaps_genes$queryHits,
ArchR Gene Activity Scores using gene body
#saveArchRProject(ArchRProj = proj, outputDirectory = "12_Copy4/", load = FALSE)
loadArchRProject("12_activity_scores_gene_body_peaks/")
proj <- addKathiGeneScoreMatrix(
proj,
genes = getGenes(proj),
peaks = getPeakSet(proj),
geneModel = "exp(-abs(x)/5000) + exp(-1)",
matrixName = "GeneScoreMatrix",
extendUpstream = c(1000, 100000),
extendDownstream = c(1000, 100000),
#geneUpstream = 5000, #New Param
#geneDownstream = 0, #New Param
useGeneBoundaries = TRUE,
useTSS = FALSE, #New Param
extendTSS = FALSE,
tileSize = 500,
ceiling = 4,
geneScaleFactor = 5, #New Param
scaleTo = 10000,
excludeChr = c("chrY", "chrM"),
blacklist = getBlacklist(proj),
threads = 1,
parallelParam = NULL,
subThreading = TRUE,
force = TRUE,
logFile = createLogFile(".addKathiGeneScoreMat"))
scores <- getMatrixFromProject(proj, useMatrix = "GeneScoreMatrix")
scores_mat <- assays(scores)[[1]]
rownames(scores_mat) <- rowData(scores)$name
# sce <- SingleCellExperiment(list(scores=scores_mat),
# rowData = as.data.frame(rowData(scores)),
# colData = as.data.frame(colnames(scores_mat)))
#
# writeH5AD(sce, "/omics/groups/OE0533/internal/katharina/scDoRI/gastrulation_data/jupyter_notebooks/p2g_gene_activity_scores/archr_scores_gene_body_peak_based", X_name = "scores")
Correlating gene expression with activity scores:
archr_gene_body_agg <- knn_aggregates(scores_mat, cell_agg_list)
gene_body_knn <- rowwise_correlations(rna_agg, archr_gene_body_agg, "ArchR gene activity scores based on peak matrix, using gene body")
cowplot::plot_grid(archr_knn[[2]], gene_body_knn[[2]], ncol = 2)
p1 <- ggplot() + geom_density_2d_filled(aes(x = gene_body_knn[[1]],
y = archr_knn[[1]]), alpha = .5) +
geom_point(aes(x = gene_body_knn[[1]], y = archr_knn[[1]])) +
geom_line(aes(x = gene_body_knn[[1]], y = gene_body_knn[[1]]), col = "red") +
theme(legend.position = "None")
ArchR Gene Activity Scores using TSS, no gene body
proj <- loadArchRProject("12_activity_scores_TSS_tiles/")
proj <- addGeneScoreMatrix(
proj,
genes = getGenes(proj),
geneModel = "exp(-abs(x)/5000)",
matrixName = "GeneScoreMatrix",
extendUpstream = c(1000, 100000),
extendDownstream = c(1000, 100000),
#geneUpstream = 5000, #New Param
#geneDownstream = 0, #New Param
useGeneBoundaries = TRUE,
useTSS = TRUE, #New Param
extendTSS = FALSE,
tileSize = 500,
ceiling = 4,
geneScaleFactor = 5, #New Param
scaleTo = 10000,
excludeChr = c("chrY", "chrM"),
blacklist = getBlacklist(proj),
threads = 1,
parallelParam = NULL,
subThreading = TRUE,
force = TRUE,
logFile = createLogFile(".addGeneScoreMatrix"))
scores <- getMatrixFromProject(proj, useMatrix = "GeneScoreMatrix")
scores_mat <- assays(scores)[[1]]
rownames(scores_mat) <- rowData(scores)$name
# sce <- SingleCellExperiment(list(scores=scores_mat),
# rowData = as.data.frame(rowData(scores)),
# colData = as.data.frame(colnames(scores_mat)))
#
# writeH5AD(sce, "/omics/groups/OE0533/internal/katharina/scDoRI/gastrulation_data/jupyter_notebooks/p2g_gene_activity_scores/archr_scores_tss", X_name = "scores")
ArchR gene activity scores computed using TSS, no gene body and PeakMatrix instead of TileMatrix
proj <- loadArchRProject("12_activity_scores_TSS_peaks/")
proj <- addKathiGeneScoreMatrix(
proj,
genes = getGenes(proj),
peaks = getPeakSet(proj),
geneModel = "exp(-abs(x)/5000)",
matrixName = "GeneScoreMatrix",
extendUpstream = c(1000, 100000),
extendDownstream = c(1000, 100000),
#geneUpstream = 5000, #New Param
#geneDownstream = 0, #New Param
useGeneBoundaries = TRUE,
useTSS = TRUE, #New Param
extendTSS = FALSE,
tileSize = 500,
ceiling = 4,
geneScaleFactor = 5, #New Param
scaleTo = 10000,
excludeChr = c("chrY", "chrM"),
blacklist = getBlacklist(proj),
threads = 1,
parallelParam = NULL,
subThreading = TRUE,
force = TRUE,
logFile = createLogFile(".addKathiGeneScoreMat"))
scores <- getMatrixFromProject(proj, useMatrix = "GeneScoreMatrix")
scores_mat <- assays(scores)[[1]]
rownames(scores_mat) <- rowData(scores)$name
#
# sce <- SingleCellExperiment(list(scores=scores_mat),
# rowData = as.data.frame(rownames(scores_mat)),
# colData = as.data.frame(colnames(scores_mat)))
#
# writeH5AD(sce, "/omics/groups/OE0533/internal/katharina/scDoRI/gastrulation_data/jupyter_notebooks/p2g_gene_activity_scores/archr_scores_peak_based", X_name = "scores")
# sce <- SingleCellExperiment(list(p2g_mat = p2g_mat))
#
# writeH5AD(sce, "/omics/groups/OE0533/internal/katharina/scDoRI/gastrulation_data/jupyter_notebooks/p2g_gene_activity_scores/p2g_mat_250kb",
# X_name = "p2g_mat")
#
#
# sce <- SingleCellExperiment(list(peak_mat = peak_mat))
#
# writeH5AD(sce, "/omics/groups/OE0533/internal/katharina/scDoRI/gastrulation_data/jupyter_notebooks/p2g_gene_activity_scores/peak_mat",
# X_name = "peak_mat")
# cp_names <- colnames(colData(gene_expr))
# cp_names[20] <- "celltypes"
# colnames(colData(gene_expr)) <- cp_names
sce <- SingleCellExperiment(list(genes = expr_mat),
#rowData = as.data.frame(rownames(gene_expr)),
colData = as.data.frame(colData(gene_expr)))
# writeH5AD(sce, "/omics/groups/OE0533/internal/katharina/scDoRI/gastrulation_data/jupyter_notebooks/p2g_gene_activity_scores/gene_expr_mat",
# X_name = "genes")
#
#
# #p2g_mat_norm <- p2g_mat / rowSums(p2g_mat)
# scores <- p2g_mat %*% peak_mat
# scores <- t(t(scores) / colSums(scores))
# stopifnot(any(is.na(scores)) == FALSE)
# scores@x <- pmin(1e9, exp(scores@x) - 1)
#
#
#
# sce <- SingleCellExperiment(list(investigation = investigation))
#
# writeH5AD(sce, "/omics/groups/OE0533/internal/katharina/scDoRI/gastrulation_data/jupyter_notebooks/p2g_gene_activity_scores/investigation_scores",
# X_name = "investigation")
# latent embedding
emb <- getReducedDims(
ArchRProj = proj,
reducedDims = "atac_LSI_100000",
returnMatrix = TRUE,
dimsToUse = 1:30,
scaleDims = NULL,
corCutOff = 0.75
)
dim(emb)
sce <- SingleCellExperiment(list(embedding = emb))
writeH5AD(sce, "/omics/groups/OE0533/internal/katharina/scDoRI/gastrulation_data/jupyter_notebooks/p2g_gene_activity_scores/archr_lsi_embedding",
X_name = "embedding")
Granja JM, Pierce SE, Corces MR. 2021. ArchR Is a Scalable Software Package for Integrative Single-Cell Chromatin Accessibility Analysis. Nat Genet. Vols. 53(3):403-411. doi: 10.1038/s41588-021-00790-6.
Persad, Sitara, Zi-Ning Choo, Christine Dien, Ignas Masilionis, Ronan Chaligné, Tal Nawy, Chrysothemis C Brown, Itsik Pe’er, Manu Setty, and Dana Pe’er. 2022. “SEACells: Inference of Transcriptional and Epigenomic Cellular States from Single-Cell Genomics Data.” bioRxiv. Cold Spring Harbor Laboratory. https://doi.org/10.1101/2022.04.02.486748.
Zuin J., Zhan Y., Roth G. 2022. “Nonlinear Control of Transcription Through Enhancer-Promoter Interactions.” Nature. https://doi.org/10.1038/s41586-022-04570-y.